Skip to content

feat(crawler): queue-backed manager revalidation fan-out#4210

Merged
bokelley merged 2 commits intomainfrom
claude/issue-4200-manager-revalidation-queue
May 8, 2026
Merged

feat(crawler): queue-backed manager revalidation fan-out#4210
bokelley merged 2 commits intomainfrom
claude/issue-4200-manager-revalidation-queue

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

@bokelley bokelley commented May 8, 2026

Closes #4200 item 2. When a manager rotates its `adagents.json`, every publisher delegating via ads.txt `MANAGERDOMAIN` needs re-validation so its `authorized_agents` view stays in sync. Inline fan-out at managed-network scale (Raptive ≈ 6K publishers) would saturate crawler concurrency, so this PR adds a persistent queue and a bounded worker tick.

What lands

  • Migration 471 `manager_revalidation_queue` table mirroring the shape of `catalog_crawl_queue` (367): idempotent insert, `next_attempt_after` for backoff, partial index for the worker's due-row scan, second index on `manager_domain` for ops-side per-manager status.
  • `cacheAdagentsManifest` change-detect: reads the previously-cached body before the upsert and compares the contributory subset (`authorized_agents`, `properties`) via recursive stable-key canonicalization. Only actual content drift triggers the fan-out — `$schema` / `last_updated` / trailing-comment noise is intentionally ignored.
  • `enqueueManagerRevalidation` walks `publishers WHERE manager_domain = $1` via the partial index added in feat(registry): persist managerdomain discovery provenance on publisher rows #4204, so a Raptive-scale rotation enumerates 6K delegating publishers via an index-only scan. Insert is idempotent and resets attempts/backoff on re-enqueue so a fresh manager change supersedes any in-flight retry window.
  • `processManagerRevalidationQueue` worker tick drains up to 50 rows per 5-minute interval at concurrency 10. Success deletes the row; failure advances exponential backoff (1h / 6h / 1d / 3d) and stores `last_error` truncated to 500 chars.

At those bounds, a 6K-publisher manager rotation propagates within ~10 hours — comfortably ahead of the 60-minute organic re-crawl cadence that any single row would catch on its own.

Tests

  • Integration (`manager-revalidation-queue.test.ts`): queue idempotency, attempts/backoff reset on re-enqueue, due-row filtering, batch limit, oldest-first ordering, success deletion, geometric backoff (1h → 6h), `last_error` truncation.
  • Unit (`manifest-content-changed.test.ts`): null previous, identical manifests, `$schema`/`last_updated` noise ignored, `authorized_agents`/`properties` change detected, order-sensitivity on `authorized_agents`, missing fields treated as empty arrays.

Out of scope

  • Item 5 (`/api/registry/managers/:domain/recrawl` endpoint): now possible because the queue + reverse-lookup are in place. Thin wrapper that calls `enqueueManagerRevalidation` and returns the count. Separate PR.
  • The `source` enum extension to `adagents_json_via_manager` for per-agent rows in `agent_property_authorizations` (still tracked under AAO crawler/API: persist managerdomain discovery provenance and reverse index #4200 follow-ups).

Refs #4200, #4173, #4204.

Closes #4200 item 2. When a manager rotates its adagents.json, every
publisher delegating via ads.txt MANAGERDOMAIN needs re-validation so
its authorized_agents view stays in sync. Inline fan-out at managed-
network scale would saturate crawler concurrency, so this PR adds a
persistent queue and a bounded worker tick.

- Migration 471: manager_revalidation_queue table mirroring the shape
  of catalog_crawl_queue (idempotent insert, next_attempt_after for
  backoff, partial index for the worker's due-row scan, second index
  on manager_domain for ops-side per-manager status).

- cacheAdagentsManifest reads the previously-cached body before the
  upsert and compares the contributory subset (authorized_agents,
  properties) via recursive stable-key canonicalization. Only actual
  content drift triggers the fan-out — $schema / last_updated /
  trailing-comment noise is intentionally ignored.

- enqueueManagerRevalidation walks publishers WHERE manager_domain = $1
  via the partial index added in #4204, so a Raptive-scale rotation
  enumerates 6K delegating publishers via an index-only scan. Insert
  is idempotent and resets attempts/backoff on re-enqueue so a fresh
  manager change supersedes any in-flight retry window.

- New crawler tick processManagerRevalidationQueue drains up to 50
  rows per 5-minute interval at concurrency 10. Success deletes the
  row; failure advances exponential backoff (1h / 6h / 1d / 3d) and
  stores last_error truncated to 500 chars. At those bounds, a 6K-
  publisher manager rotation propagates within ~10 hours, comfortably
  ahead of the 60-minute organic re-crawl cadence for any single row.

- Tests: integration coverage for queue idempotency, due-row filtering,
  oldest-first ordering, success deletion, and exponential backoff. Unit
  coverage for the change-detection helper, including order-sensitivity
  on authorized_agents and ignored noise fields.
* insert, exponential backoff on failure, deletion on success.
*/
import { describe, it, expect, beforeAll, beforeEach, afterAll } from 'vitest';
import { initializeDatabase, closeDatabase, query } from '../../src/db/client.js';
Three follow-ups from code review of #4210:

1. Wire startPeriodicManagerRevalidation(5) into http.ts bootstrap next
   to startPeriodicCatalogCrawl. Without this the worker tick never
   fires and the queue fills but never drains. Critical fix.

2. Tighten the in-line comment in cacheAdagentsManifest to call out
   that the enqueue is intentionally outside the upsert transaction:
   on enqueue failure the cache write is already committed, but the
   next 60-min routine crawl re-detects drift and re-enqueues, so
   silent fan-out loss self-heals.

3. Clarify enqueueManagerRevalidation's docstring — the returned count
   is rows touched (inserts + ON CONFLICT updates), since superseding
   a stale row counts toward the load-bearing semantic of "this fresh
   manager change overrides any in-flight backoff."
@bokelley bokelley merged commit 2e1de36 into main May 8, 2026
13 checks passed
@bokelley bokelley deleted the claude/issue-4200-manager-revalidation-queue branch May 8, 2026 19:00
bokelley added a commit that referenced this pull request May 8, 2026
Closes #4200 item 5. New POST /api/registry/manager-revalidation-request short-circuits the 60-minute organic crawl cycle: when a manager rotates its adagents.json, ops can hit this endpoint and have every delegating publisher enqueued for immediate re-validation.

Thin wrapper around enqueueManagerRevalidation (#4210). Body: { manager_domain }. Returns 202 with publishers_enqueued. Rate-limited via the shared validateAndRateLimitCrawl machinery; key namespaced (manager: prefix) so a manager request doesn't bypass an in-window publisher recrawl on the same domain.

Per-agent source enum extension to 'adagents_json_via_manager' is closed as won't-fix on #4200: publishers.discovery_method (#4204) already lets consumers join through and discriminate, and a separate per-agent value would silently exclude managerdomain rows from existing readers filtering on source='adagents_json'.

Refs #4200, #4173, #4204, #4210.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AAO crawler/API: persist managerdomain discovery provenance and reverse index

1 participant